Querying Very Large Multi-dimensional Datasets in ADR - Extended Abstract

نویسندگان

  • Tahsin Kurc
  • Chialin Chang
  • Renato Ferreira
  • Alan Sussman
  • Joel Saltz
چکیده

Analysis and processing of very large multi-dimensional scientific datasets (i.e. where data items are associated with points in a multi-dimensional attribute space) is an important component of science and engineering. Moreover, an increasing number of applications make use of very large multi-dimensional datasets. Examples of such datasets include raw and processed sensor data from satellites [12], output from hydrodynamics and chemical transport simulations [10], and archives of medical images [1]. Many applications that make use of multi-dimensional datasets have several important characteristics. Both the input and the output are often disk-resident datasets. Applications may use only a subset of all the data available in input and output datasets. Access to data items is described by a range query, namely a multi-dimensionalbounding box in the underlying multi-dimensionalattribute space of the dataset. Only the data items whose associated coordinates fall within the multi-dimensional box are retrieved. The processing structures of these applications also share common characteristics. Figure 1 shows high-level pseudo-code for the basic processing loop in these applications. The processing steps consist of retrieving input and output data items that intersect the range query (steps 1–2 and 4–5), mapping the coordinates of the retrieved input items to the corresponding output items (step 6), and aggregating, in some way, all the retrieved input items mapped to the same output data items (steps 7–8). Correctness of the output usually does not depend on the order input data items are aggregated. The mapping function, Map(ie), maps an input item to a set of output items. An intermediate data structure, referred to as an accumulator, is used to hold intermediate results during processing. For example, an accumulator can be used to keep a running sum for an averaging operation. The aggregation function, Aggregate(ie; ae), aggregates the value of an input item with the intermediate result stored in the accumulator element (ae). The output dataset from a query is usually much smaller than the input dataset, hence steps 4–8 are called the reduction phase of the processing. Accumulator elements are allocated and initialized (step 3) before the reduction phase. Another constraint is that there is

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A High-Performance Database System for Managing Large Multi-resolution Medical Images

In this work we address the design of a database system to explore, process, and visualize very large (multiterabyte) multi-resolution image datasets, obtained from MRI, CT and ultrasound, and digitized microscopy images. The basic requirements for such a database management system include (1) support for adding and managing user-defined processing functions, (2) managing datasets stored in dis...

متن کامل

Object-Relational Queries into Multidimensional Databases with the Active Data Repository

As computational power and storage capacity increase, processing and analyzing large volumes of multi-dimensional datasets play an increasingly important role in many domains of scienti c research. Scienti c applications that make use of very large scienti c datasets have several important characteristics: datasets consist of complex data and are usually multi-dimensional; applications usually ...

متن کامل

Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets

Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, the datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space. The processing is usually highly stylized, with the basic processing steps consisting of (1) retrieval of a subset...

متن کامل

Optimizing Retrieval and Processing of Multi-Dimensional Scientific Datasets

Exploring and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. We have been developing the Active Data Repository (ADR), an infrastructure that integrates storage, retrieval, and processing of large multi-dimensional scientific datasets on distributed memory parallel machines with multiple disks attached to each node. In earlier work, ...

متن کامل

DEVise : Integrated Querying and Visual Exploration of Large Datasets ( DEMO

DEVise is a data exploration system that allows users to easily develop, browse, and share visual presentations of large tabular datasets (possibly containing or referencing multi-media objects) from several sources. The DEVise framework , implemented in a tool that has been already successfully applied to a variety of real applications by a number of user groups, makes several contributions. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999